3/7/23
Due Dates:
Notes:
…will only work if you finished last set of notes (or open this week’s lab).
When is pivoting your data from wide to long (or long to wide) helpful?
What do you want to know?
Notes from class:
Groupskim| Name | nyts_data |
| Number of rows | 95465 |
| Number of columns | 59 |
| _______________________ | |
| Column type frequency: | |
| character | 7 |
| logical | 43 |
| numeric | 9 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| psu | 0 | 1.00 | 5 | 6 | 0 | 431 | 0 |
| stratum | 0 | 1.00 | 3 | 3 | 0 | 16 | 0 |
| Age | 417 | 1.00 | 1 | 3 | 0 | 11 | 0 |
| Sex | 778 | 0.99 | 4 | 6 | 0 | 2 | 0 |
| Grade | 477 | 1.00 | 1 | 14 | 0 | 8 | 0 |
| brand_ecig | 91861 | 0.04 | 3 | 7 | 0 | 7 | 0 |
| Group | 0 | 1.00 | 7 | 23 | 0 | 4 | 0 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| ECIGT | 1470 | 0.98 | 0.18 | FAL: 76819, TRU: 17176 |
| ECIGAR | 1757 | 0.98 | 0.15 | FAL: 79724, TRU: 13984 |
| ESLT | 1862 | 0.98 | 0.07 | FAL: 87041, TRU: 6562 |
| EELCIGT | 1715 | 0.98 | 0.26 | FAL: 69417, TRU: 24333 |
| EROLLCIGTS | 2929 | 0.97 | 0.05 | FAL: 88100, TRU: 4436 |
| EFLAVCIGTS | 78469 | 0.18 | 0.05 | FAL: 16126, TRU: 870 |
| EBIDIS | 2940 | 0.97 | 0.01 | FAL: 91369, TRU: 1156 |
| EFLAVCIGAR | 58568 | 0.39 | 0.09 | FAL: 33530, TRU: 3367 |
| EHOOKAH | 2388 | 0.97 | 0.09 | FAL: 84515, TRU: 8562 |
| EPIPE | 2941 | 0.97 | 0.02 | FAL: 90348, TRU: 2176 |
| ESNUS | 2941 | 0.97 | 0.03 | FAL: 89379, TRU: 3145 |
| EDISSOLV | 2939 | 0.97 | 0.01 | FAL: 91423, TRU: 1103 |
| CCIGT | 1840 | 0.98 | 0.05 | FAL: 88747, TRU: 4878 |
| CCIGAR | 2019 | 0.98 | 0.05 | FAL: 88423, TRU: 5023 |
| CSLT | 2173 | 0.98 | 0.03 | FAL: 90496, TRU: 2796 |
| CELCIGT | 1505 | 0.98 | 0.12 | FAL: 82799, TRU: 11161 |
| CROLLCIGTS | 3049 | 0.97 | 0.02 | FAL: 90440, TRU: 1976 |
| CFLAVCIGTS | 78521 | 0.18 | 0.02 | FAL: 16529, TRU: 415 |
| CBIDIS | 3038 | 0.97 | 0.01 | FAL: 91956, TRU: 471 |
| CHOOKAH | 2666 | 0.97 | 0.03 | FAL: 89657, TRU: 3142 |
| CPIPE | 3061 | 0.97 | 0.01 | FAL: 91603, TRU: 801 |
| CSNUS | 3053 | 0.97 | 0.01 | FAL: 91198, TRU: 1214 |
| CDISSOLV | 3050 | 0.97 | 0.01 | FAL: 91938, TRU: 477 |
| menthol | 17711 | 0.81 | 0.06 | FAL: 73305, TRU: 4449 |
| clove_spice | 17711 | 0.81 | 0.01 | FAL: 77360, TRU: 394 |
| fruit | 17711 | 0.81 | 0.07 | FAL: 71945, TRU: 5809 |
| chocolate | 17711 | 0.81 | 0.01 | FAL: 76875, TRU: 879 |
| alcoholic_drink | 17711 | 0.81 | 0.02 | FAL: 76510, TRU: 1244 |
| candy_dessert_sweets | 17711 | 0.81 | 0.05 | FAL: 74188, TRU: 3566 |
| other | 17711 | 0.81 | 0.03 | FAL: 75675, TRU: 2079 |
| EHTP | 78434 | 0.18 | 0.02 | FAL: 16633, TRU: 398 |
| CHTP | 76592 | 0.20 | 0.02 | FAL: 18582, TRU: 291 |
| tobacco_ever | 0 | 1.00 | 0.35 | FAL: 61793, TRU: 33672 |
| tobacco_current | 0 | 1.00 | 0.18 | FAL: 78757, TRU: 16708 |
| ecig_ever | 0 | 1.00 | 0.25 | FAL: 71132, TRU: 24333 |
| ecig_current | 0 | 1.00 | 0.12 | FAL: 84304, TRU: 11161 |
| non_ecig_ever | 0 | 1.00 | 0.27 | FAL: 69674, TRU: 25791 |
| non_ecig_current | 0 | 1.00 | 0.12 | FAL: 84419, TRU: 11046 |
| ecig_only_ever | 0 | 1.00 | 0.05 | FAL: 90308, TRU: 5157 |
| ecig_only_current | 0 | 1.00 | 0.03 | FAL: 92756, TRU: 2709 |
| non_ecig_only_ever | 0 | 1.00 | 0.07 | FAL: 89147, TRU: 6318 |
| non_ecig_only_current | 0 | 1.00 | 0.03 | FAL: 92439, TRU: 3026 |
| no_use | 0 | 1.00 | 0.65 | TRU: 61738, FAL: 33727 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| year | 0 | 1 | 2017.02 | 1.40 | 2015.00 | 2016.00 | 2017.00 | 2018.00 | 2019.00 | ▇▇▇▇▇ |
| finwgt | 0 | 1 | 1421.44 | 1093.13 | 11.15 | 708.52 | 1131.48 | 1754.38 | 6505.08 | ▇▅▁▁▁ |
| tobacco_sum_ever | 0 | 1 | 0.91 | 1.68 | 0.00 | 0.00 | 0.00 | 1.00 | 12.00 | ▇▁▁▁▁ |
| tobacco_sum_current | 0 | 1 | 0.34 | 0.97 | 0.00 | 0.00 | 0.00 | 0.00 | 11.00 | ▇▁▁▁▁ |
| ecig_sum_ever | 0 | 1 | 0.25 | 0.44 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▃ |
| ecig_sum_current | 0 | 1 | 0.12 | 0.32 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| non_ecig_sum_ever | 0 | 1 | 0.66 | 1.41 | 0.00 | 0.00 | 0.00 | 1.00 | 11.00 | ▇▁▁▁▁ |
| non_ecig_sum_current | 0 | 1 | 0.23 | 0.78 | 0.00 | 0.00 | 0.00 | 0.00 | 10.00 | ▇▁▁▁▁ |
| n | 0 | 1 | 19167.48 | 1193.77 | 17711.00 | 17872.00 | 19018.00 | 20189.00 | 20675.00 | ▇▁▃▁▇ |
Note: If you include this in a report, you should also guide the viewer. Theres a lot in there. What do you want your reader to know?
Things discussed in class when looking at skim output:
brand_ecig has so many missing values (only have brand information for 2019)mean calculation for a categorical value means (proportion of TRUE values)Age min and max values look off (b/c it’s coded as a character due to >18 category)n category is how many respondents there are each yearRemember: It can be very helpful to thinkg “what overall trends do I see?” and “is there anything weird going on?” any time you’re looking at an EDA output
Getting a sense for some of the categorical data
>18 10 11 12 13 14 15 16 17 18 9
727 50 5360 13499 14613 14036 13498 13205 12754 7108 198
Combination of products Neither Only e-cigarettes
16517 61738 7866
Only other products
9344
Remember: If you include this in a report, you’ll also need text to explain what the reader should know/take away from any of these.
(Note: I argued that this was not the best way to display these data…but it’s a good start. If you include in your report, you likely want to improve!)
Student suggestion: Relative proportion of e-cig use by gender (anecdote: spent cartridges in male br and not female in hs; hypothesis: male use higher than females, but want to see)
Student suggestion: Distribution of most popular brands (all brands)
# A tibble: 12 × 3
# Groups: year [5]
year brand_ecig n
<dbl> <chr> <int>
1 2015 <NA> 17711
2 2016 <NA> 20675
3 2017 <NA> 17872
4 2018 <NA> 20189
5 2019 Blu 111
6 2019 JUUL 2028
7 2019 Logic 36
8 2019 MarkTen 32
9 2019 NJOY 44
10 2019 Other 1253
11 2019 Vuse 100
12 2019 <NA> 15414
The above helps us remember that we only have brand information from 2019.
Note: If including in a report, you’ll want titles, cleaner axis names, and likely to sort the x-axis values, but for your first pass EDA where you’re just trying to understand the data, this is sufficient.
Student suggestion: Distribution of most popular flavors (all flavors)
This one is more complicated b/c flavor data are across multiple columns…going to leave this for the analysis set of notes
Student suggestion: year on the x-axis; y is frequency of use; some measurement of use (e-cig, tobacco, …)
nyts_data |>
group_by(year) |>
# count things
summarize(mean_ecig_ever = mean(ecig_ever, na.rm=TRUE),
mean_tobacco_ever = mean(tobacco_ever, na.rm=TRUE)) |>
pivot_longer(-year, names_to = "variable", values_to = "values") |>
ggplot(aes(x=year, y=values, group=variable, linetype=variable)) +
geom_line()